Web Page Downloading and Classification
نویسندگان
چکیده
This paper describes the processes of downloading and classifying Web-based articles in online medical journals as a preliminary step to extracting bibliographic data to populate MEDLINE , the widely used database of the National Library of Medicine (NLM). The processes are combined to develop an automated system named “Web Page Downloading and Classification”. The system downloads the Web pages using Microsoft’s Windows Internet API tool called WinInet, and a combination of several Artificial Intelligence (AI) techniques including the Breadth-First search algorithm and the Constraint Satisfaction method. The Breadth-First search algorithm and the Constraint Satisfaction method are then used to traverse the Web page’s links, identify these pages as abstract, full text, PDF or image files, recognize and generate the successors of the downloading pages.
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملDemographic and motivation variables associated with Internet usage activities
Examines demographic variables (gender, age, educational level) and motivation variables (perceived ease of use, perceived enjoyment, perceived usefulness) associated with Internet usage activities (defined in terms of messaging, browsing, downloading and purchasing). A total of 1,370 usable responses were obtained using a Web page survey. Results showed that males are more likely to engage in ...
متن کاملAutomated Article Links Identification for Web-based Online Medical Journals
As part of research into Web-based document analysis including Web page downloading and classification, an algorithm has been developed to automatically identify article links in Web-based online journals. This algorithm is based on feature vectors calculated from attributes and contents of links extracted from HTML files, and an instancebased learning algorithm using a nearest neighbor methodo...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملAnalysis of Sources of Latency in Downloading Web Pages
Why does it take so long to download a Web page from a Web server? We analyze the download latency for pages for a variety of situations, in which Web browser and server are both within the same country as well as in different countries. Our study examines several sources of latency in accessing Web pages: DNS, TCP, the Web server itself, and the network links and routers. We divide the total d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001